Scripting for large-scale sequencing based on Hadoop

نویسندگان

  • André Schumacher
  • Luca Pireddu
  • Aleksi Kallio
  • Matti Niemenmaa
  • Eija Korpelainen
  • Gianluigi Zanetti
  • Keijo Heljanko
چکیده

Motivation and Objectives The large volumes of data generated by modern sequencing experiments present significant challenges in their manipulation and analysis. Traditional approaches, such as scripting and relational database queries, are often found to be inadequate, frustratingly slow, or complicated to scale. These problems have already been faced by the “big data revolution” in data-based activities resulting in novel computational paradigms such as MapReduce and scalable tools such as Hadoop and Pig. We describe our ongoing work on SeqPig, a tool that facilitates the use of the Pig Latin scripting language to manipulate, analyze and query sequencing data. SeqPig provides access to popular data formats and implements a number of high level functions. Most importantly, it grants users access to the proven to be scalable platform that is Hadoop from a high level scripting language, whether the cluster is run locally or in the cloud.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop

SUMMARY Hadoop MapReduce-based approaches have become increasingly popular due to their scalability in processing large sequencing datasets. However, as these methods typically require in-depth expertise in Hadoop and Java, they are still out of reach of many bioinformaticians. To solve this problem, we have created SeqPig, a library and a collection of tools to manipulate, analyze and query se...

متن کامل

Adaptive Dynamic Data Placement Algorithm for Hadoop in Heterogeneous Environments

Hadoop MapReduce framework is an important distributed processing model for large-scale data intensive applications. The current Hadoop and the existing Hadoop distributed file system’s rack-aware data placement strategy in MapReduce in the homogeneous Hadoop cluster assume that each node in a cluster has the same computing capacity and a same workload is assigned to each node. Default Hadoop d...

متن کامل

Exploring Non-Homogeneity and Dynamicity of High Scale Cloud through Hive and Pig

The trace consists of cell information of about 29 days spanning across 700k jobs. This paper deals with statistical analysis of this cluster trace. Since the size of trace is very large, Hive which is a Hadoop distributed file system (HDFS) based platform for querying and analysis of Big data, has been used. Hive was accessed through its Beeswax interface. The data was imported into HDFS throu...

متن کامل

BioPig: a Hadoop-based analytic toolkit for large-scale sequence data

MOTIVATION The recent revolution in sequencing technologies has led to an exponential growth of sequence data. As a result, most of the current bioinformatics tools become obsolete as they fail to scale with data. To tackle this 'data deluge', here we introduce the BioPig sequence analysis toolkit as one of the solutions that scale to data and computation. RESULTS We built BioPig on the Apach...

متن کامل

Hadoop-BAM: directly manipulating next generation sequencing data in the cloud

Hadoop-BAM is a novel library for the scalable manipulation of aligned next-generation sequencing data in the Hadoop distributed computing framework. It acts as an integration layer between analysis applications and BAM files that are processed using Hadoop. Hadoop-BAM solves the issues related to BAM data access by presenting a convenient API for implementing map and reduce functions that can ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013